Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Tutorial on Correlations in R
  • Teaching
    • Data Viz and Analytics
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Graphs
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • ๐Ÿงญ Basics of Statistical Inference
        • ๐ŸŽฒ Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • ๐Ÿƒ Inference for a Single Mean
        • ๐Ÿƒ Inference for Two Independent Means
        • ๐Ÿƒ Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • ๐Ÿƒ Testing a Single Proportion
        • ๐Ÿƒ Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • ๐Ÿ•” Modelling and Predicting Time Series
      • Predictive Modelling
        • ๐Ÿ‰ Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • ๐Ÿ“ Intro to Linear Programming
        • ๐Ÿ’ญ The Simplex Method - Intuitively
        • ๐Ÿ“… The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Using AI in Analytics
        • Case Studies
          • Demo:Product Packaging and Elderly People
          • Ikea Furniture
          • Movie Profits
          • Gender at the Work Place
          • Heptathlon
          • School Scores
          • Children's Games
          • Valentineโ€™s Day Spending
          • Women Live Longer?
          • Hearing Loss in Children
          • California Transit Payments
          • Seaweed Nutrients
          • Coffee Flavours
          • Legionnaireโ€™s Disease in the USA
          • Antarctic Sea ice
          • William Farr's Observations on Cholera in London
      • TRIZ for Problem Solvers
        • I am Water
        • I am What I yam
        • Birds of Different Feathers
        • I Connect therefore I am
        • I Think, Therefore I am
        • The Art of Parallel Thinking
        • A Year of Metaphoric Thinking
        • TRIZ - Problems and Contradictions
        • TRIZ - The Unreasonable Effectiveness of Available Resources
        • TRIZ - The Ideal Final Result
        • TRIZ - A Contradictory Language
        • TRIZ - The Contradiction Matrix Workflow
        • TRIZ - The Laws of Evolution
        • TRIZ - Substance Field Analysis, and ARIZ
      • Math Models for Creative Coders
        • Maths Basics
          • Vectors
          • Matrix Algebra Whirlwind Tour
          • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
        • Tech
          • Tools and Installation
          • Adding Libraries to p5.js
          • Using Constructor Objects in p5.js
        • Geometry
          • Circles
          • Complex Numbers
          • Fractals
          • Affine Transformation Fractals
          • L-Systems
          • Kolams and Lusona
        • Media
          • Fourier Series
          • Additive Sound Synthesis
          • Making Noise Predictably
          • The Karplus-Strong Guitar Algorithm
        • AI
          • Working with Neural Nets
          • The Perceptron
          • The Multilayer Perceptron
          • MLPs and Backpropagation
          • Gradient Descent
        • Projects
          • Projects
      • Tech for Creative Education
        • ๐Ÿงญ Using Idyll
        • ๐Ÿงญ Using Apparatus
        • ๐Ÿงญ Using g9.js
      • Literary Jukebox: In Short, the World
        • Italy - Dino Buzzati
        • France - Guy de Maupassant
        • Japan - Hisaye Yamamoto
        • Peru - Ventura Garcia Calderon
        • Russia - Maxim Gorky
        • Egypt - Alifa Rifaat
        • Brazil - Clarice Lispector
        • England - V S Pritchett
        • Russia - Ivan Bunin
        • Czechia - Milan Kundera
        • Sweden - Lars Gustaffsson
        • Canada - John Cheever
        • Ireland - William Trevor
        • USA - Raymond Carver
        • Italy - Primo Levi
        • India - Ruth Prawer Jhabvala
        • USA - Carson McCullers
        • Zimbabwe - Petina Gappah
        • India - Bharati Mukherjee
        • USA - Lucia Berlin
        • USA - Grace Paley
        • England - Angela Carter
        • USA - Kurt Vonnegut
        • Spain-Merce Rodoreda
        • Israel - Ruth Calderon
        • Israel - Etgar Keret
    • Posts
    • Blogs and Talks

    On this page

    • Introduction
    • Setting up R Packages
    • Case Study 1: Galton Dataset from mosaicData
      • Inspecting the Data
      • Correlations and Plots
      • Visualizing Uncertainty in Correlation Estimates
    • Case Study#2: Dataset from NHANES
      • Inspecting the Data
      • Correlations and Plots
      • Visualizing Uncertainty in Correlation Estimates
    • Conclusion
    • References

    Tutorial on Correlations in R

    Author

    Arvind Venkatadri

    Published

    November 22, 2022

    Modified

    July 29, 2025

    Abstract
    Tests, Tables, and Graphs for Correlations in R

    Introduction

    We will create Tables for Correlations, and graphs for Correlations in R. As always, we will consistently use the Project Mosaic ecosystem of packages in R (mosaic, mosaicData and ggformula).

    Setting up R Packages

    library(tidyverse)
    library(mosaic) # package for stats, simulations, and basic plots
    library(ggformula) # package for professional looking plots, that use the formula interface from mosaic
    library(skimr)
    library(GGally)
    library(corrplot) # For Correlogram plots
    library(broom) # to properly format stat test results
    
    library(mosaicData) # package containing datasets
    library(NHANES) # survey data collected by the US National Center for Health Statistics (NCHS)

    All R functions seen in the code are clickable links that take you to online documentation about the function. Try!
    TipThe Formula interface

    Note the standard method for all commands from the mosaic package:

    goal( y ~ x | z, data = mydata, โ€ฆ)

    With ggformula, one can create any graph/chart using:

    gf_geometry(y ~ x | z, data = mydata)

    OR

    mydata %>% gf_geometry( y ~ x | z )

    The second method may be preferable, especially if you have done some data manipulation first! More about this later!

    Case Study 1: Galton Dataset from mosaicData

    Let us inspect what datasets are available in the package mosaicData. Run this command in your Console:

    # Run in Console
    data(package = "mosaicData")

    The popup tab shows a lot of datasets we could use. Let us continue to use the famous Galton dataset and inspect it:

    data("Galton")

    Inspecting the Data

    The inspect command already gives us a series of statistical measures of different variables of interest. As discussed previously, we can retain the output of inspect and use it in our reports: (there are ways of dressing up these tables too)

    galton_describe <- inspect(Galton)
    
    galton_describe$categorical
    galton_describe$quantitative

    Try help("Galton") in your Console. The dataset is described as:

    A data frame with 898 observations on the following variables.
    - family a factor with levels for each family
    - father the fatherโ€™s height (in inches)
    - mother the motherโ€™s height (in inches)
    - sex the childโ€™s sex: F or M
    - height the childโ€™s height as an adult (in inches)
    - nkids the number of adult children in the family, or, at least, the number whose heights Galton recorded.

    There is a lot of Description generated by the mosaic::inspect() command ! Let us also look at the output of skim:

    skimr::skim(Galton)
    Data summary
    Name Galton
    Number of rows 898
    Number of columns 6
    _______________________
    Column type frequency:
    factor 2
    numeric 4
    ________________________
    Group variables None

    Variable type: factor

    skim_variable n_missing complete_rate ordered n_unique top_counts
    family 0 1 FALSE 197 185: 15, 166: 11, 66: 11, 130: 10
    sex 0 1 FALSE 2 M: 465, F: 433

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    father 0 1 69.23 2.47 62 68 69.0 71.0 78.5 โ–โ–…โ–‡โ–‚โ–
    mother 0 1 64.08 2.31 58 63 64.0 65.5 70.5 โ–‚โ–…โ–‡โ–ƒโ–
    height 0 1 66.76 3.58 56 64 66.5 69.7 79.0 โ–โ–‡โ–‡โ–…โ–
    nkids 0 1 6.14 2.69 1 4 6.0 8.0 15.0 โ–ƒโ–‡โ–†โ–‚โ–

    What can we say about the dataset and its variables? How big is the dataset? How many variables? What types are they, Quant or Qual? If they are Qual, what are the levels? Are they ordered levels? Which variables could have relationships with others? Why? Write down these Questions!

    Correlations and Plots

    What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?

    NoteQuestions

    How does childrenโ€™s height correlate with that of father and mother? Is this relationship also affected by sex of the child?

    With this question, height becomes our target variable, which we should always plot on the dependent y-axis.

    # Pulling out the list of Quant variables from NHANES
    galton_quant <- galton_describe$quantitative
    galton_quant$name
    [1] "father" "mother" "height" "nkids" 
    GGally::ggpairs(
      Galton,
    
      # Choose the variables we want to plot for
      columns = c("father", "mother", "height", "nkids"),
      switch = "both", # axis labels in more traditional locations
      progress = FALSE, # no compute progress messages needed
    
      # Choose the diagonal graphs (always single variable! Think!)
      diag = list(continuous = "barDiag"), # choosing histogram,not density
    
      # Choose lower triangle graphs, two-variable graphs
      lower = list(continuous = wrap("smooth", alpha = 0.1)),
      title = "Galton Data Correlations Plot"
    ) +
    
      theme_bw()

    We note that childrenโ€™s height is correlated with that of father and mother. The correlations are both positive, and that with father seems to be the larger of the two. ( Look at the slopes of the lines and the values of the correlation scores. )

    NoteQuestion

    What if we group the Quant variables based on a Qual variable, like sex of the child?

    # Pulling out the list of Quant variables from NHANES
    galton_quant <- galton_describe$quantitative
    galton_quant$name
    [1] "father" "mother" "height" "nkids" 
    GGally::ggpairs(
      Galton,
      mapping = aes(colour = sex), # Colour by `sex`
    
      # Choose the variables we want to plot for
      columns = c("father", "mother", "height", "nkids"),
      switch = "both", # axis labels in more traditional locations
      progress = FALSE, # no compute progress messages needed
    
      diag = list(continuous = "barDiag"),
    
      # Choose lower triangle graphs, two-variable graphs
      lower = list(continuous = wrap("smooth", alpha = 0.1)),
      title = "Galton Data Correlations Plot"
    ) +
    
      theme_bw()

    The split scatter plots are useful, as is the split histogram for height: Clearly the correlation of childrenโ€™s height with father and mother is positive for both sex-es. The other plots, and even some of the correlations scores are not all useful! Just shows everything we can compute is not necessarily useful immediately.

    In later modules we will see how to plot correlations when the number of variables is larger still.

    NoteQuestion

    Can we plot a Correlogram for this dataset?

    # library(corrplot)
    
    galton_num_var <- Galton %>% select(father, mother, height, nkids)
    galton_cor <- cor(galton_num_var)
    galton_cor %>%
      corrplot(
        method = "ellipse",
        type = "lower",
        main = "Correlogram for Galton dataset"
      )

    Clearly height is positively correlated to father and mother; interestingly, height is negatively correlated ( slightly) with nkids.

    NoteQuestion

    Let us confirm with a correlation test:

    We will use the mosaic function cor_test to get these results:

    mosaic::cor_test(height ~ father, data = Galton) %>%
      broom::tidy() %>%
      knitr::kable(
        digits = 2,
        caption = "Children vs Fathers"
      )
    Children vs Fathers
    estimate statistic p.value parameter conf.low conf.high method alternative
    0.28 8.57 0 896 0.21 0.33 Pearsonโ€™s product-moment correlation two.sided
    mosaic::cor_test(height ~ mother, data = Galton) %>%
      broom::tidy() %>%
      knitr::kable(
        digits = 2,
        caption = "Children vs Mothers"
      )
    Children vs Mothers
    estimate statistic p.value parameter conf.low conf.high method alternative
    0.2 6.16 0 896 0.14 0.26 Pearsonโ€™s product-moment correlation two.sided
    ImportantCorrelation Scores and Uncertainty

    Note how the mosaic::cor_test() reports a correlation score estimate and the p-value for the same. There is also a confidence interval reported for the correlation score, an interval within which we are 95% sure that the true correlation value is to be found.

    Note that GGally::ggpairs() too reports the significance of the correlation scores estimates using *** or **. This indicates the p-value in the scores obtained by GGally; Presumably, there is an internal cor_test that is run for each pair of variables and the p-value and confidence levels are also computed internally.

    In both cases, we used the formula \(height \sim other-variable\), in keeping with our idea of height being the dependent, target variable..

    We also see the p.value for the estimateed correlation is negligible, and the conf.low/conf.high interval does not straddle \(0\). These attest to the significance of the correlation score.

    NoteQuestion

    What does this correlation look when split by sex of Child?

    # For the sons
    
    mosaic::cor_test(height ~ father,
      data = Galton %>% filter(sex == "M")
    ) %>%
      broom::tidy() %>%
      knitr::kable(
        digits = 2,
        caption = "Sons vs Fathers"
      )
    cor_test(height ~ mother,
      data = Galton %>% filter(sex == "M")
    ) %>%
      broom::tidy() %>%
      knitr::kable(
        digits = 2,
        caption = "Sons vs Mothers"
      )
    
    # For the daughters
    cor_test(height ~ father,
      data = Galton %>% filter(sex == "F")
    ) %>%
      broom::tidy() %>%
      knitr::kable(
        digits = 2,
        caption = "Daughters vs Fathers"
      )
    cor_test(height ~ mother,
      data = Galton %>% filter(sex == "F")
    ) %>%
      broom::tidy() %>%
      knitr::kable(
        digits = 2,
        caption = "Daughters vs Mothers"
      )
    Sons vs Fathers
    estimate statistic p.value parameter conf.low conf.high method alternative
    0.39 9.15 0 463 0.31 0.47 Pearsonโ€™s product-moment correlation two.sided
    Sons vs Mothers
    estimate statistic p.value parameter conf.low conf.high method alternative
    0.33 7.63 0 463 0.25 0.41 Pearsonโ€™s product-moment correlation two.sided
    Daughters vs Fathers
    estimate statistic p.value parameter conf.low conf.high method alternative
    0.46 10.72 0 431 0.38 0.53 Pearsonโ€™s product-moment correlation two.sided
    Daughters vs Mothers
    estimate statistic p.value parameter conf.low conf.high method alternative
    0.31 6.86 0 431 0.23 0.4 Pearsonโ€™s product-moment correlation two.sided

    The same observation as made above ( p.value and confidence intervals) applies here too and tells us that the estimated correlations are significant.

    Visualizing Uncertainty in Correlation Estimates

    We can also visualize this uncertainty and the confidence levels in a plot too, using gf_errorbar and a handy set of functions within purrr which is part of the tidyverse. Assuming heights is the target variable we want to correlate every other (quantitative) variable against, we can proceed very quickly as follows: we will first plot correlation uncertainty for one pair of variables to develop the intuition, and then for all variables against the one target variable:

    mosaic::cor_test(height ~ mother, data = Galton) %>%
      broom::tidy() %>%
      # We need a graph not a table
      # So comment out this line from the earlier code
      # knitr::kable(digits = 2,caption = "Children vs Mothers")
    
      rowid_to_column(var = "index") %>% # Need an index to plot with
    
      # Uncertainty as error-bars
      gf_errorbar(conf.high + conf.low ~ index, linewidth = 2) %>%
      # Estimate as a point
      gf_point(estimate ~ index, color = "red", size = 6) %>%
      # Labels
      gf_text(estimate ~ index - 0.2,
        label = "Correlation Score = estimate"
      ) %>%
      gf_text(conf.high * 0.98 ~ index - 0.25,
        label = "Upper Limit = estimate + conf.high"
      ) %>%
      gf_text(conf.low * 1.04 ~ index - 0.25,
        label = "Lower Limit = estimate - conf.low"
      ) %>%
      gf_theme(theme_bw())

    We can now do this for all variables against the target variable height, which we identified in our research question. We will use the iteration capabilities offered by the tidyverse package, purrr:

    all_corrs <- Galton %>%
      select(where(is.numeric)) %>%
      # leave off height to get all the remaining ones
      select(-height) %>%
      # perform a cor.test for all variables against height
      purrr::map(
        .x = .,
        .f = \(x) cor.test(x, Galton$height)
      ) %>%
      # tidy up the cor.test outputs into a tidy data frame
      map_dfr(broom::tidy, .id = "predictor")
    
    all_corrs
    all_corrs %>%
      # arrange the predictors in order of their correlation scores
      # with the target variable (`height`)
      # Add errorbars to show uncertainty ranges / confidence intervals
      # Use errorbar width and linewidth fo emphasis
      gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
        color = ~estimate,
        width = 0.2,
        linewidth = ~ -log10(p.value)
      ) %>%
      # All correlation estimates as points
      gf_point(estimate ~ reorder(predictor, estimate),
        color = "black"
      ) %>%
      # Reference line at zero correlation score
      gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>%
      # Themes,Titles, and Scales
      gf_labs(
        x = NULL, y = "Correlation with height in Galton",
        caption = "Significance = - log10(p.value)"
      ) %>%
      gf_refine(
    
        # Scale for colour
        scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
    
        # Scale for dumbbells!!
        scale_linewidth_continuous("significance",
          range = c(0.5, 4)
        )
      ) %>%
      gf_refine(guides(linewidth = guide_legend(reverse = TRUE))) %>%
      gf_theme(theme_classic())

    We can clearly see the size of the correlations and the confidence intervals marked in this plot. father has somewhat greater correlation with childrenโ€™s height, as compared to mother. nkids seems to matter very little. This kind of plot will be very useful when we pursue linear regression models.

    NoteQuestion

    How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galtonโ€™s famous diagram?

    # For the sons
    gf_point(height ~ father,
      data = Galton %>% filter(sex == "M"),
      title = "Soms and Fathers"
    ) %>%
      gf_smooth(method = "lm") %>%
      gf_theme(theme_minimal())
    gf_point(height ~ mother,
      data = Galton %>% filter(sex == "M"),
      title = "Sons and Mothers"
    ) %>%
      gf_smooth(method = "lm") %>%
      gf_theme(theme_minimal())
    # For the daughters
    gf_point(height ~ father,
      data = Galton %>% filter(sex == "F"),
      title = "Daughters and Fathers"
    ) %>%
      gf_smooth(method = "lm") %>%
      gf_theme(theme_minimal())
    gf_point(height ~ mother,
      data = Galton %>% filter(sex == "F"),
      title = "Daughters and Mothers"
    ) %>%
      gf_smooth(method = "lm") %>%
      gf_theme(theme_minimal())

    An approximation to Galtonโ€™s famous plot1 (see Wikipedia):

    gf_point(height ~ (father + mother) / 2, data = Galton) %>%
      gf_smooth(method = "lm") %>%
      gf_density_2d(n = 8) %>%
      gf_abline(slope = 1) %>%
      gf_theme(theme_minimal())
    gf_point(height ~ (father + mother) / 2, data = Galton) %>%
      gf_smooth(method = "lm") %>%
      gf_ellipse(level = 0.95, color = "red") %>%
      gf_ellipse(level = 0.75, color = "blue") %>%
      gf_ellipse(level = 0.5, color = "green") %>%
      gf_abline(slope = 1) %>%
      gf_theme(theme_minimal())

    How would you interpret this plot2?

    Case Study#2: Dataset from NHANES

    Let us look at the NHANES dataset from the package NHANES:

    data("NHANES")

    Inspecting the Data

    NHANES_describe <- inspect(NHANES)
    
    NHANES_describe$categorical
    NHANES_describe$quantitative
    NHANES
    skimr::skim(NHANES)
    Data summary
    Name NHANES
    Number of rows 10000
    Number of columns 76
    _______________________
    Column type frequency:
    factor 31
    numeric 45
    ________________________
    Group variables None

    Variable type: factor

    skim_variable n_missing complete_rate ordered n_unique top_counts
    SurveyYr 0 1.00 FALSE 2 200: 5000, 201: 5000
    Gender 0 1.00 FALSE 2 fem: 5020, mal: 4980
    AgeDecade 333 0.97 FALSE 8 40: 1398, 0-: 1391, 10: 1374, 20: 1356
    Race1 0 1.00 FALSE 5 Whi: 6372, Bla: 1197, Mex: 1015, Oth: 806
    Race3 5000 0.50 FALSE 6 Whi: 3135, Bla: 589, Mex: 480, His: 350
    Education 2779 0.72 FALSE 5 Som: 2267, Col: 2098, Hig: 1517, 9 -: 888
    MaritalStatus 2769 0.72 FALSE 6 Mar: 3945, Nev: 1380, Div: 707, Liv: 560
    HHIncome 811 0.92 FALSE 12 mor: 2220, 750: 1084, 250: 958, 350: 863
    HomeOwn 63 0.99 FALSE 3 Own: 6425, Ren: 3287, Oth: 225
    Work 2229 0.78 FALSE 3 Wor: 4613, Not: 2847, Loo: 311
    BMICatUnder20yrs 8726 0.13 FALSE 4 Nor: 805, Obe: 221, Ove: 193, Und: 55
    BMI_WHO 397 0.96 FALSE 4 18.: 2911, 30.: 2751, 25.: 2664, 12.: 1277
    Diabetes 142 0.99 FALSE 2 No: 9098, Yes: 760
    HealthGen 2461 0.75 FALSE 5 Goo: 2956, Vgo: 2508, Fai: 1010, Exc: 878
    LittleInterest 3333 0.67 FALSE 3 Non: 5103, Sev: 1130, Mos: 434
    Depressed 3327 0.67 FALSE 3 Non: 5246, Sev: 1009, Mos: 418
    SleepTrouble 2228 0.78 FALSE 2 No: 5799, Yes: 1973
    PhysActive 1674 0.83 FALSE 2 Yes: 4649, No: 3677
    TVHrsDay 5141 0.49 FALSE 7 2_h: 1275, 1_h: 884, 3_h: 836, 0_t: 638
    CompHrsDay 5137 0.49 FALSE 7 0_t: 1409, 0_h: 1073, 1_h: 1030, 2_h: 589
    Alcohol12PlusYr 3420 0.66 FALSE 2 Yes: 5212, No: 1368
    SmokeNow 6789 0.32 FALSE 2 No: 1745, Yes: 1466
    Smoke100 2765 0.72 FALSE 2 No: 4024, Yes: 3211
    Smoke100n 2765 0.72 FALSE 2 Non: 4024, Smo: 3211
    Marijuana 5059 0.49 FALSE 2 Yes: 2892, No: 2049
    RegularMarij 5059 0.49 FALSE 2 No: 3575, Yes: 1366
    HardDrugs 4235 0.58 FALSE 2 No: 4700, Yes: 1065
    SexEver 4233 0.58 FALSE 2 Yes: 5544, No: 223
    SameSex 4232 0.58 FALSE 2 No: 5353, Yes: 415
    SexOrientation 5158 0.48 FALSE 3 Het: 4638, Bis: 119, Hom: 85
    PregnantNow 8304 0.17 FALSE 3 No: 1573, Yes: 72, Unk: 51

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    ID 0 1.00 61944.64 5871.17 51624.00 56904.50 62159.50 67039.00 71915.00 โ–‡โ–‡โ–‡โ–‡โ–‡
    Age 0 1.00 36.74 22.40 0.00 17.00 36.00 54.00 80.00 โ–‡โ–‡โ–‡โ–†โ–…
    AgeMonths 5038 0.50 420.12 259.04 0.00 199.00 418.00 624.00 959.00 โ–‡โ–‡โ–‡โ–†โ–ƒ
    HHIncomeMid 811 0.92 57206.17 33020.28 2500.00 30000.00 50000.00 87500.00 100000.00 โ–ƒโ–†โ–ƒโ–โ–‡
    Poverty 726 0.93 2.80 1.68 0.00 1.24 2.70 4.71 5.00 โ–…โ–…โ–ƒโ–ƒโ–‡
    HomeRooms 69 0.99 6.25 2.28 1.00 5.00 6.00 8.00 13.00 โ–‚โ–†โ–‡โ–‚โ–
    Weight 78 0.99 70.98 29.13 2.80 56.10 72.70 88.90 230.70 โ–‚โ–‡โ–‚โ–โ–
    Length 9457 0.05 85.02 13.71 47.10 75.70 87.00 96.10 112.20 โ–โ–ƒโ–†โ–‡โ–ƒ
    HeadCirc 9912 0.01 41.18 2.31 34.20 39.58 41.45 42.92 45.40 โ–โ–‚โ–‡โ–‡โ–…
    Height 353 0.96 161.88 20.19 83.60 156.80 166.00 174.50 200.40 โ–โ–โ–โ–‡โ–‚
    BMI 366 0.96 26.66 7.38 12.88 21.58 25.98 30.89 81.25 โ–‡โ–†โ–โ–โ–
    Pulse 1437 0.86 73.56 12.16 40.00 64.00 72.00 82.00 136.00 โ–‚โ–‡โ–ƒโ–โ–
    BPSysAve 1449 0.86 118.15 17.25 76.00 106.00 116.00 127.00 226.00 โ–ƒโ–‡โ–‚โ–โ–
    BPDiaAve 1449 0.86 67.48 14.35 0.00 61.00 69.00 76.00 116.00 โ–โ–โ–‡โ–‡โ–
    BPSys1 1763 0.82 119.09 17.50 72.00 106.00 116.00 128.00 232.00 โ–‚โ–‡โ–‚โ–โ–
    BPDia1 1763 0.82 68.28 13.78 0.00 62.00 70.00 76.00 118.00 โ–โ–โ–‡โ–†โ–
    BPSys2 1647 0.84 118.48 17.49 76.00 106.00 116.00 128.00 226.00 โ–ƒโ–‡โ–‚โ–โ–
    BPDia2 1647 0.84 67.66 14.42 0.00 60.00 68.00 76.00 118.00 โ–โ–โ–‡โ–†โ–
    BPSys3 1635 0.84 117.93 17.18 76.00 106.00 116.00 126.00 226.00 โ–ƒโ–‡โ–‚โ–โ–
    BPDia3 1635 0.84 67.30 14.96 0.00 60.00 68.00 76.00 116.00 โ–โ–โ–‡โ–‡โ–
    Testosterone 5874 0.41 197.90 226.50 0.25 17.70 43.82 362.41 1795.60 โ–‡โ–‚โ–โ–โ–
    DirectChol 1526 0.85 1.36 0.40 0.39 1.09 1.29 1.58 4.03 โ–…โ–‡โ–‚โ–โ–
    TotChol 1526 0.85 4.88 1.08 1.53 4.11 4.78 5.53 13.65 โ–‚โ–‡โ–โ–โ–
    UrineVol1 987 0.90 118.52 90.34 0.00 50.00 94.00 164.00 510.00 โ–‡โ–…โ–‚โ–โ–
    UrineFlow1 1603 0.84 0.98 0.95 0.00 0.40 0.70 1.22 17.17 โ–‡โ–โ–โ–โ–
    UrineVol2 8522 0.15 119.68 90.16 0.00 52.00 95.00 171.75 409.00 โ–‡โ–†โ–ƒโ–‚โ–
    UrineFlow2 8524 0.15 1.15 1.07 0.00 0.48 0.76 1.51 13.69 โ–‡โ–โ–โ–โ–
    DiabetesAge 9371 0.06 48.42 15.68 1.00 40.00 50.00 58.00 80.00 โ–โ–‚โ–†โ–‡โ–‚
    DaysPhysHlthBad 2468 0.75 3.33 7.40 0.00 0.00 0.00 3.00 30.00 โ–‡โ–โ–โ–โ–
    DaysMentHlthBad 2466 0.75 4.13 7.83 0.00 0.00 0.00 4.00 30.00 โ–‡โ–โ–โ–โ–
    nPregnancies 7396 0.26 3.03 1.80 1.00 2.00 3.00 4.00 32.00 โ–‡โ–โ–โ–โ–
    nBabies 7584 0.24 2.46 1.32 0.00 2.00 2.00 3.00 12.00 โ–‡โ–…โ–โ–โ–
    Age1stBaby 8116 0.19 22.65 4.77 14.00 19.00 22.00 26.00 39.00 โ–†โ–‡โ–…โ–‚โ–
    SleepHrsNight 2245 0.78 6.93 1.35 2.00 6.00 7.00 8.00 12.00 โ–โ–…โ–‡โ–โ–
    PhysActiveDays 5337 0.47 3.74 1.84 1.00 2.00 3.00 5.00 7.00 โ–‡โ–‡โ–ƒโ–…โ–…
    TVHrsDayChild 9347 0.07 1.94 1.43 0.00 1.00 2.00 3.00 6.00 โ–‡โ–†โ–‚โ–‚โ–‚
    CompHrsDayChild 9347 0.07 2.20 2.52 0.00 0.00 1.00 6.00 6.00 โ–‡โ–โ–โ–โ–ƒ
    AlcoholDay 5086 0.49 2.91 3.18 1.00 1.00 2.00 3.00 82.00 โ–‡โ–โ–โ–โ–
    AlcoholYear 4078 0.59 75.10 103.03 0.00 3.00 24.00 104.00 364.00 โ–‡โ–โ–โ–โ–
    SmokeAge 6920 0.31 17.83 5.33 6.00 15.00 17.00 19.00 72.00 โ–‡โ–‚โ–โ–โ–
    AgeFirstMarij 7109 0.29 17.02 3.90 1.00 15.00 16.00 19.00 48.00 โ–โ–‡โ–‚โ–โ–
    AgeRegMarij 8634 0.14 17.69 4.81 5.00 15.00 17.00 19.00 52.00 โ–‚โ–‡โ–โ–โ–
    SexAge 4460 0.55 17.43 3.72 9.00 15.00 17.00 19.00 50.00 โ–‡โ–…โ–โ–โ–
    SexNumPartnLife 4275 0.57 15.09 57.85 0.00 2.00 5.00 12.00 2000.00 โ–‡โ–โ–โ–โ–
    SexNumPartYear 5072 0.49 1.34 2.78 0.00 1.00 1.00 1.00 69.00 โ–‡โ–โ–โ–โ–

    Try help("NHANES") in your Console.

    This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960โ€™s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination centre (MEC).

    The dataset is described as: A data frame with 100000 observations on 76 variables. Some of these are:
    - Race1 and Race2: factors with 5 and 6 levels respectively
    - Education a factor with 5 levels
    - HHIncomeMid Total annual gross income for the household in US dollars.
    - Age
    - BMI: Body mass index (weight/height2 in kg/m2)
    - Height: Standing height in cm.
    - Weight: Weight in kg > > - Testosterone: Testosterone total (ng/dL) - PhysActiveDays: Number of days in a typical week that participant does moderate or vigorous-intensity activity.
    - CompHrsDay: Number of hours per day on average participant used a computer or gaming device over the past 30 days.

    ImportantMissing Data

    Why do so many of the variables have missing entries? What could be your guess about the Experiment/Survey`?

    Let us make some counts of the data, since we have so many factors:

    NHANES %>% count(Gender)
    NHANES %>% count(Race1)
    NHANES %>% count(Race3)
    NHANES %>% count(Education)
    NHANES %>% count(MaritalStatus)

    There is a good mix of factors and counts.

    Now we articulate our Research Questions:

    NoteResearch Questions
    1. Does Testosterone have a relationship with parameters such as BMI, Weight, Height, PhysActiveDays CompHrsDay and Age?

    2. Does HHIncomeMid have a relationship with these same parameters? And with Gender?

    3. Are there any other pairwise correlations that we should note? (This is especially useful in choosing independent variables for multiple regression)

    ( Yes we are concerned with men more than with the women, sadly.)

    Correlations and Plots

    GGally::ggpairs(NHANES,
      # Choose the variables we want to plot for
      columns = c(
        "HHIncomeMid", "Weight", "Height",
        "BMI", "Gender"
      ),
    
      # LISTs of graphs needed at different locations
      # For different combinations of variables
      diag = list(continuous = "barDiag"),
      lower = list(continuous = wrap("smooth", alpha = 0.01)),
      upper = list(continuous = "cor"),
      switch = "both", # axis labels in more traditional locations
      progress = FALSE
    ) + # No compute progress bars needed
      theme_bw()

    We see that HHIncomeMid is Quantitative, discrete valued variable, since it is based on a set of median incomes for different ranges of income. BMI, Weight, Height are continuous Quant variables.

    HHIncomeMid also seems to be relatively unaffected by Weight; And is only mildly correlated with Height and BMI, as seen both by the correlation score magnitudes and the slopes of the trend lines.

    There is a difference in the median income by Gender, but we will defer that kind of test for later, when we do Statistical Inference.

    Unsurprisingly, BMI and Weight have a strong relationship, as do Height and Weight; the latter is of course non-linear, since the Height levels off at a point.

    GGally::ggpairs(NHANES,
      columns = c("Testosterone", "Weight", "Height", "BMI"),
      diag = list(continuous = "barDiag"),
      lower = list(continuous = wrap("smooth", alpha = 0.01)),
      upper = list(continuous = "cor"),
      switch = "both",
      progress = FALSE
    ) +
      theme_bw()

    It is clear that Testosterone has strong relationships with Height and Weight but not so much with BMI.

    Visualizing Uncertainty in Correlation Estimates

    Since the pairs plot is fairly clear for both target variables, let us head to visualizing the significance and uncertainty in the correlation estimates.

    HHIncome_corrs <- NHANES %>%
      select(where(is.numeric)) %>%
      # leave off height to get all the remaining ones
      select(-HHIncomeMid) %>%
      # perform a cor.test for all variables against height
      purrr::map(
        .x = .,
        .f = \(x) cor.test(x, NHANES$HHIncomeMid)
      ) %>%
      # tidy up the cor.test outputs into a tidy data frame
      map_dfr(broom::tidy, .id = "predictor")
    
    HHIncome_corrs
    HHIncome_corrs %>%
      # Reference line at zero correlation score
      gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>%
      # arrange the predictors in order of their correlation scores
      # with the target variable (`height`)
      # Add errorbars to show uncertainty ranges / confidence intervals
      # Use errorbar width and linewidth fo emphasis
      gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
        color = ~estimate,
        width = 0.2,
        linewidth = ~ -log10(p.value + 0.001)
      ) %>%
      # All correlation estimates as points
      gf_point(estimate ~ reorder(predictor, estimate),
        color = "black"
      ) %>%
      # Themes,Titles, and Scales
      gf_labs(
        x = NULL, y = "Correlations with HouseHold Median Income",
        caption = "Significance = - log10(p.value)"
      ) %>%
      gf_theme(theme_classic()) %>%
      # Scale for colour
      gf_refine(
        guides(linewidth = guide_legend(reverse = TRUE)),
        scale_colour_distiller("Correlation",
          type = "div",
          palette = "RdBu"
        ),
    
        # Scale for dumbbells!!
        scale_linewidth_continuous("Significance", range = c(0.05, 2)),
        theme(axis.text.y = element_text(size = 6, hjust = 1)),
        coord_flip()
      )

    If we select just the variables from our Research Question:

    HHIncome_corrs_select <- NHANES %>%
      select(Height, Weight, BMI) %>% # Only change is here!
    
      # leave off height to get all the remaining ones
      # select(- HHIncomeMid) %>%
    
      # perform a cor.test for all variables against height
      purrr::map(
        .x = .,
        .f = \(x) cor.test(x, NHANES$HHIncomeMid)
      ) %>%
      # tidy up the cor.test outputs into a tidy data frame
      map_dfr(broom::tidy, .id = "predictor")
    
    HHIncome_corrs_select
    HHIncome_corrs_select %>%
      # arrange the predictors in order of their correlation scores
      # with the target variable (`height`)
      # Add errorbars to show uncertainty ranges / confidence intervals
      # Use errorbar width and linewidth fo emphasis
      gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
        color = ~estimate,
        width = 0.2,
        linewidth = ~ -log10(p.value + 0.000001)
      ) %>%
      # All correlation estimates as points
      gf_point(estimate ~ reorder(predictor, estimate),
        color = "black"
      ) %>%
      # Reference line at zero correlation score
      gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>%
      # Themes,Titles, and Scales
      gf_labs(
        x = NULL, y = "Correlations with HouseHold Median Income",
        caption = "Significance = - log10(p.value + 0.000001)"
      ) %>%
      gf_theme(theme_classic()) %>%
      # Scale for colour
      gf_refine(
        guides(linewidth = guide_legend(reverse = TRUE)),
        scale_colour_distiller("Correlation",
          type = "div",
          palette = "RdBu"
        ),
    
        # Scale for dumbbells!!
        scale_linewidth_continuous("Significance", range = c(0.05, 2)),
        theme(axis.text.y = element_text(size = 8, hjust = 1)),
        coord_flip()
      )

    So we might say taller people make more money? And fatter people make slightly less money? Well, the magnitude of the correlations (aka effect size) are low so we would not imagine this to be a hypothesis that we can defend.

    Let us look at the Testosterone variable: trying all variables shows some paucity of observations ( due to missing data), so we will stick with our chosen variables:

    Testosterone_corrs <- NHANES %>%
      select(Height, Weight, BMI) %>%
      # leave off height to get all the remaining ones
      # select(- Testosterone) %>%
    
      # perform a cor.test for all variables against height
      purrr::map(
        .x = .,
        .f = \(x) cor.test(x, NHANES$Testosterone)
      ) %>%
      # tidy up the cor.test outputs into a tidy data frame
      map_dfr(broom::tidy, .id = "predictor")
    
    Testosterone_corrs
    Testosterone_corrs %>%
      # Reference line at zero correlation score
      gf_hline(
        yintercept = 0,
        color = "grey",
        linewidth = 2
      ) %>%
      # arrange the predictors in order of their correlation scores
      # with the target variable (`height`)
      # Add errorbars to show uncertainty ranges / confidence intervals
      # Use errorbar width and linewidth fo emphasis
      gf_errorbar(
        conf.high + conf.low ~ reorder(predictor, estimate),
        color = ~estimate,
        width = 0.2,
        linewidth = ~ -log10(p.value + 0.000001)
      ) %>%
      # All correlation estimates as points
      gf_point(estimate ~ reorder(predictor, estimate),
        color = "black"
      ) %>%
      # Themes,Titles, and Scales
      gf_labs(
        x = NULL, y = "Correlations with Testosterone Levels",
        caption = "Significance = - log10(p.value + 0.000001)"
      ) %>%
      gf_theme(theme_classic()) %>%
      # Scale for colour
      gf_refine(
        guides(linewidth = guide_legend(reverse = TRUE)),
        scale_colour_distiller("Correlation",
          type = "div",
          palette = "RdBu"
        ),
    
        # Scale for dumbbells!!
        scale_linewidth_continuous("Significance", range = c(0.05, 2)),
        theme(axis.text.y = element_text(size = 8, hjust = 1)),
        coord_flip()
      )

    Conclusion

    We have a decent Correlations related workflow in R:

    • Load the dataset
    • inspect/skim/glimpse the dataset, identify Quant and Qual variables
    • Identify a target variable based on your knowledge of the data, how it was gathered, who gathered it and what was their intent
    • Develop Pair-Wise plots + Correlations using GGally::ggpairs()
    • Develop Correlogram corrplot::corrplot
    • Check everything with a cor_test: effect size,significance, confidence intervals
    • Use purrr + cor.test to plot correlations and confidence intervals for multiple Quant predictor variables against the target variable
    • Plot scatter plots using gf_point.
    • Add extra lines using gf_abline() to compare hypotheses that you may have.

    References

    R Package Citations
    Package Version Citation
    corrplot 0.95 Wei and Simko (2024)
    GGally 2.2.1 Schloerke et al. (2024)
    ggformula 0.12.0 Kaplan and Pruim (2023)
    mosaic 1.9.1 Pruim, Kaplan, and Horton (2017)
    mosaicData 0.20.4 Pruim, Kaplan, and Horton (2023)
    NHANES 2.1.0 Pruim (2015)
    TeachHist 0.2.1 Lange (2023)
    TeachingDemos 2.13 Snow (2024)
    Kaplan, Daniel, and Randall Pruim. 2023. ggformula: Formula Interface to the Grammar of Graphics. https://doi.org/10.32614/CRAN.package.ggformula.
    Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://doi.org/10.32614/CRAN.package.TeachHist.
    Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://doi.org/10.32614/CRAN.package.NHANES.
    Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. โ€œThe Mosaic Package: Helping Students to โ€˜Think with Dataโ€™ Using r.โ€ The R Journal 9 (1): 77โ€“102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
    Pruim, Randall, Daniel Kaplan, and Nicholas Horton. 2023. mosaicData: Project MOSAIC Data Sets. https://doi.org/10.32614/CRAN.package.mosaicData.
    Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to โ€œggplot2โ€. https://doi.org/10.32614/CRAN.package.GGally.
    Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://doi.org/10.32614/CRAN.package.TeachingDemos.
    Wei, Taiyun, and Viliam Simko. 2024. R Package โ€œcorrplotโ€: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
    Back to top

    Footnotes

    1. http://euclid.psych.yorku.ca/SCS/Gallery/images/galton-corr.jpg>โ†ฉ๏ธŽ

    2. https://www.researchgate.net/figure/Galtons-smoothed-correlation-diagram-for-the-data-on-heights-of-parents-and-children_fig15_226400313โ†ฉ๏ธŽ

    Citation

    BibTeX citation:
    @online{venkatadri2022,
      author = {Venkatadri, Arvind},
      title = {Tutorial on {Correlations} in {R}},
      date = {2022-11-22},
      url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html},
      langid = {en},
      abstract = {Tests, Tables, and Graphs for Correlations in R}
    }
    
    For attribution, please cite this work as:
    Venkatadri, Arvind. 2022. โ€œTutorial on Correlations in R.โ€ November 22, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html.

    License: CC BY-SA 2.0

    Website made with โค๏ธ and Quarto, by Arvind V.

    Hosted by Netlify .